Subset selection

Author

Jorge Bris Moreno

Assignment by: Dr. Purna Gamage

Instructions

Submission:

Optional:

Code
#knitr::opts_chunk$set(include = FALSE) # for making prompts
#knitr::opts_chunk$set(echo = TRUE) # for making solutions
library(tidyverse)
library(modeldata)
library(leaps)
library(caret)
library(corrplot)
library(ggplot2)
library(viridis)

HW-2.1: Heating values

Use best subset selection methods to determine the best heating values

For bioenergy production, the heating value is a measure of the amount of heat released during combustion. The Higher heating value (HHV) is a particular method for determining the heat released during combustion. The higher HHV the more energy released for a given amount of material. You will use the biomass dataset from the modeldata package. Run a ?biomass after importing the data to read about the domain. The response variable is HHV and the predictor variables are the percentages of different elements. Do not include the sample and dataset variables in your analysis.

HW-2.1.a Create scatter-plots of the response and predictor variables and comment on your findings.

Code
# IMPORT 
data(biomass)

# EXPLORE DATA
print(class(biomass))
[1] "data.frame"
Code
print(dim(biomass))
[1] 536   8
Code
print(biomass[1:3,1:8])
                  sample  dataset carbon hydrogen oxygen nitrogen sulfur    HHV
1           Akhrot Shell Training  49.81     5.64  42.94     0.41   0.00 20.008
2 Alabama Oak Wood Waste Training  49.50     5.70  41.30     0.20   0.00 19.228
3                  Alder Training  47.82     5.80  46.25     0.11   0.02 18.299
Code
# INSERT CODE 

my_theme <- readRDS('~/.Rthemes/my_theme.rds')

ggplot(biomass, aes(x = carbon, y = HHV)) + 
  geom_point(aes(color = carbon)) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "carbon vs HHV", 
       x = "carbon", 
       y = "HHV") + 
  my_theme()
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(biomass, aes(x = hydrogen, y = HHV)) + 
  geom_point(aes(color = hydrogen)) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "hydrogen vs HHV", 
       x = "hydrogen", 
       y = "HHV") + 
  my_theme()
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(biomass, aes(x = oxygen, y = HHV)) + 
  geom_point(aes(color = oxygen)) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "oxygen vs HHV", 
       x = "oxygen", 
       y = "HHV") + 
  my_theme()
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(biomass, aes(x = nitrogen, y = HHV)) + 
  geom_point(aes(color = nitrogen)) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "nitrogen vs HHV", 
       x = "nitrogen", 
       y = "HHV") + 
  my_theme()
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(biomass, aes(x = sulfur, y = HHV)) + 
  geom_point(aes(color = sulfur)) +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "sulfur vs HHV", 
       x = "sulfur", 
       y = "HHV") + 
  my_theme()
`geom_smooth()` using formula = 'y ~ x'

We can see from these scatter plots that carbon has the strongest correlation with HHV, but oxygen and hydrogen seem to also have some. For Nitrogen and sulfur is hard to tell. Thus, we will also be checking this with a correlation plot for additional visualization.

Code
# Generate a correlation plot
cor_matrix <- cor(biomass[, sapply(biomass, is.numeric)])

corrplot(cor_matrix, method = "circle", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black",
         diag = FALSE)

This proves what we mention, and oxigen and hydrogen still seem to have a “strong” correlation with HHV.

HW-2.1.b Split the dataset into an 80-20 training and test sets.

Code
# INSERT CODE 
paste("Current training labelled:",(sum(biomass$dataset == "Training"))/nrow(biomass))
[1] "Current training labelled: 0.850746268656716"
Code
paste("Current testing labelled:",(sum(biomass$dataset == "Testing"))/nrow(biomass))
[1] "Current testing labelled: 0.149253731343284"

We will drop the column dataset as we are not using it and it does not have the training or test quantities we want, and then split the data.

Code
# drop dataset column
biomass <- subset(biomass, select = -dataset)
# checking column names
colnames(biomass)
[1] "sample"   "carbon"   "hydrogen" "oxygen"   "nitrogen" "sulfur"   "HHV"     
Code
# Partition data
partition <- createDataPartition(biomass$HHV, p = 0.8, list = FALSE)
training_set <- biomass[partition, ]
testing_set <- biomass[-partition, ]

We will also drop column sample as it is unique for each observatioin and it is categorical.

Code
training_set <- subset(biomass, select = -sample)
testing_set <- subset(biomass, select = -sample)

HW-2.1.c Use regsubsets() to perform best subset selection to pick the best model according to \(C_p\), BIC, and adjusted \(R^2\).

Code
# INSERT CODE 
set.seed(4444)

result <- regsubsets(HHV ~ ., data = training_set, nvmax = 5, method = "exhaustive")

summary_result <- summary(result)
Code
par(mfrow = c(2, 2))

plot(summary_result$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_result$cp), summary_result$cp[which.min(summary_result$cp)], col = "red", cex = 2, pch = 20)

plot(summary_result$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_result$bic), summary_result$bic[which.min(summary_result$bic)], col = "red", cex = 2, pch = 20)

plot(summary_result$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_result$adjr2), summary_result$adjr2[which.max(summary_result$adjr2)], col = "red", cex = 2, pch = 20)

It seems that a three variable model is our best pick from this graphs.

HW-2.1.d Repeat this procedure for forward stepwise selection and backward stepwise selection, compare the best models from each selection method.

Code
# INSERT CODE
forward_model <- regsubsets(HHV ~ ., data=training_set, method="forward", nvmax = NULL)

forward_summary <- summary(forward_model)

backward_model <- regsubsets(HHV ~ ., data=training_set, method="backward", nvmax=NULL)

backward_summary <- summary(backward_model)
Code
# INSERT CODE 
# Plot forward
par(mfrow = c(2, 2))

plot(forward_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(forward_summary$cp), forward_summary$cp[which.min(forward_summary$cp)], col = "red", cex = 2, pch = 20)

plot(forward_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(forward_summary$bic), forward_summary$bic[which.min(forward_summary$bic)], col = "red", cex = 2, pch = 20)

plot(forward_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(forward_summary$adjr2), forward_summary$adjr2[which.max(forward_summary$adjr2)], col = "red", cex = 2, pch = 20)

# Plot Backwards
par(mfrow = c(2, 2))

Code
plot(backward_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(backward_summary$cp), backward_summary$cp[which.min(backward_summary$cp)], col = "red", cex = 2, pch = 20)

plot(backward_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(backward_summary$bic), backward_summary$bic[which.min(backward_summary$bic)], col = "red", cex = 2, pch = 20)

plot(backward_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(backward_summary$adjr2), backward_summary$adjr2[which.max(backward_summary$adjr2)], col = "red", cex = 2, pch = 20)

We can see that these methods also say that three variable model is our best option.

HW-2.1.e Use the predict() function to investigate the test performance in RMSE using your “best model”.

Code
# INSERT CODE 

# Select the best model for each
backward_best_index <- which.max(backward_summary$adjr2)
forward_best_index <- which.max(forward_summary$adjr2)

# Best variables
backward_best_vars <- names(coef(backward_model, id = backward_best_index))
forward_best_vars <- names(coef(forward_model, id = forward_best_index))
# Without intercept
backward_best_vars <- backward_best_vars[backward_best_vars != "(Intercept)"]
forward_best_vars <- forward_best_vars[forward_best_vars != "(Intercept)"]

# formulas so I can fit lm
backward_formula_str <- paste("HHV ~", paste(backward_best_vars, collapse=" + "))
forward_formula_str <- paste("HHV ~", paste(forward_best_vars, collapse=" + "))

backward_formula <- as.formula(backward_formula_str)
forward_formula <- as.formula(forward_formula_str)

# Fit model
backwards_best_model_fit <- lm(backward_formula, data=training_set)
forwards_best_model_fit <- lm(forward_formula, data=training_set)

# preds
predictions_back <- predict(backwards_best_model_fit, newdata=testing_set)
predictions_forw <- predict(forwards_best_model_fit, newdata=testing_set)

# RMSE
rmse_back <- sqrt(mean((testing_set$HHV - predictions_back)^2, na.rm = TRUE))
rmse_forw <- sqrt(mean((testing_set$HHV - predictions_forw)^2, na.rm = TRUE))

# RMSE
print(paste("Backward RMSE:", rmse_back))
[1] "Backward RMSE: 1.43045828417102"
Code
print(paste("Forward RMSE:", rmse_forw))
[1] "Forward RMSE: 1.43045828417102"

HW-2.2: Loan applications

Create your own cross validation algorithm to predict the interest rate for loan applications

Lending club gained fame for being one of the first major players in retail lending. The dataset lending_club in the model_data package includes 9,857 loans that were provided. Interest rates of loans are often a good indicator of the level of risk associated with lending. If you are likely to pay back a loan, then you will likely be charged lower interest than someone who has a higher chance of default. Your goal is to determine the best model for predicting the interest rate charged to borrowers using best, forward, and backward subset selection within a five-fold cross-validation framework.

Prep steps: - drop all rows with missing data in the following columns

HW-2.2.a Create a correlation plot of all the numeric variables in the dataset using the corrplot package to create a high quality graph, then comment on your findings

Code
# INSERT CODE
data("lending_club", package = "modeldata")
loan <- lending_club
numeric_vars <- select_if(loan, is.numeric)

cor_matrix <- cor(loan[, sapply(loan, is.numeric)])

corrplot(cor_matrix, method = "circle", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black",
         diag = FALSE)

While there seems to be correlation between other different variables, interest rate is correlated with many others but at a very low scale, the most being with all_util with 0.29. All the other correlations are between $$0.25 which are not very strong correlations. This may make having a good model more complicated, but it can still be done (up to a certain point).

HW-2.2.b Run best, forward, and backward subset selection on the entire dataset comment on the findings

We will retrieve only the numeric variables (so taking away state) as professors said its ok since it will run for too long (even if we do one hot encoding).

Code
loan <- na.omit(lending_club)

numeric_loan <- loan %>% select_if(is.numeric)
Code
# INSERT CODE 
# Best
best_subset_fit <- regsubsets(int_rate ~ ., data=numeric_loan, nvmax=NULL, method="exhaustive", really.big = T)

summary_best <- summary(best_subset_fit)

#summary_best

par(mfrow = c(2, 2))

plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_best$cp), summary_best$cp[which.min(summary_best$cp)], col = "red", cex = 2, pch = 20)

plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_best$bic), summary_best$bic[which.min(summary_best$bic)], col = "red", cex = 2, pch = 20)

plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_best$adjr2), summary_best$adjr2[which.max(summary_best$adjr2)], col = "red", cex = 2, pch = 20)

Code
# INSERT CODE 
# Forward

forward_fit <- regsubsets(int_rate ~ ., data=numeric_loan, method="forward", nvmax=NULL)

summary_forward <- summary(forward_fit)

#summary_forward

par(mfrow = c(2, 2))

plot(summary_forward$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_forward$cp), summary_forward$cp[which.min(summary_forward$cp)], col = "red", cex = 2, pch = 20)

plot(summary_forward$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_forward$bic), summary_forward$bic[which.min(summary_forward$bic)], col = "red", cex = 2, pch = 20)

plot(summary_forward$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_forward$adjr2), summary_forward$adjr2[which.max(summary_forward$adjr2)], col = "red", cex = 2, pch = 20)

Code
# INSERT CODE 
# backward

backward_fit <- regsubsets(int_rate ~ ., data=numeric_loan, method="backward", nvmax=NULL)

summary_backward <- summary(backward_fit)

#summary_backward

par(mfrow = c(2, 2))

plot(summary_backward$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_backward$cp), summary_backward$cp[which.min(summary_backward$cp)], col = "red", cex = 2, pch = 20)

plot(summary_backward$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_backward$bic), summary_backward$bic[which.min(summary_backward$bic)], col = "red", cex = 2, pch = 20)

plot(summary_backward$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_backward$adjr2), summary_backward$adjr2[which.max(summary_backward$adjr2)], col = "red", cex = 2, pch = 20)

They all seem to agree that 10 variables is the best. While some variable will add better predictions for some of the measurements, the better value is so little that we can agree on 10.

HW-2.2.c Create a five-fold cross-validation algorithm using for loops to compare the CV mse performance of your best two models

Code
library(boot)

Attaching package: 'boot'
The following object is masked from 'package:lattice':

    melanoma
Code
#| vscode: {languageId: r}
# INSERT CODE 

# Selecting the model based on CP as it is between BIC and r^2adj
best_bic_backward_index <- which.min(summary_backward$cp)
best_vars_backward <- names(coef(backward_fit, id = best_bic_backward_index))

best_bic_best_subset_index <- which.min(summary_best$cp)
best_vars_best_subset <- names(coef(best_subset_fit, id = best_bic_best_subset_index))

backward_formula <- as.formula(paste("int_rate ~", paste(best_vars_backward[-1], collapse = " + ")))
best_subset_formula <- as.formula(paste("int_rate ~", paste(best_vars_best_subset[-1], collapse = " + ")))
Code
# INSERT CODE 
set.seed(4444)

folds <- cut(seq(1, nrow(numeric_loan)), breaks=5, labels=FALSE)
cv_mse_backward <- numeric(5)
cv_mse_best_subset <- numeric(5)

for(i in 1:5){
  test_indices <- which(folds == i)
  train_indices <- setdiff(seq_len(nrow(numeric_loan)), test_indices)
  
  train_set <- numeric_loan[train_indices, ]
  test_set <- numeric_loan[test_indices, ]
  # Fit both models
  fit_backward <- glm(backward_formula, data = train_set)
  fit_best_subset <- glm(best_subset_formula, data = train_set)
  # Preds of both
  predictions_backward <- predict(fit_backward, newdata = test_set)
  predictions_best_subset <- predict(fit_best_subset, newdata = test_set)
  # MSE
  cv_mse_backward[i] <- mean((test_set$int_rate - predictions_backward)^2, na.rm = TRUE)
  cv_mse_best_subset[i] <- mean((test_set$int_rate - predictions_best_subset)^2, na.rm = TRUE)
}

# Avg MSE
avg_cv_mse_backward <- mean(cv_mse_backward)
avg_cv_mse_best_subset <- mean(cv_mse_best_subset)

print(paste("Average CV MSE for Backward Selection:", avg_cv_mse_backward))
[1] "Average CV MSE for Backward Selection: 17.5556682537638"
Code
print(paste("Average CV MSE for Exhaustive Selection:", avg_cv_mse_best_subset))
[1] "Average CV MSE for Exhaustive Selection: 17.5556682537638"

HW-2.4: Advertising budget

Using cross-validation to select best advertising budget

In this problem, we use the Advertising data download here. We want to predict Sales from TV, Radio and Newspaper, using multiple regression with all three predictors plus up to one interaction term of these three predictors, e.g. TV * Radio or Radio * Newspaper.

HW-2.4.a Should such an interaction term be included? Which one? Try to answer this question by estimating the residual standard error using 10-fold cross validation for all four possible models.

Possible models:

  • Sales \(\sim\) TV + Radio + Newspaper

  • Sales \(\sim\) TV + Radio + Newspaper + TV:Radio

  • Sales \(\sim\) TV + Radio + Newspaper + TV:Newspaper

  • Sales \(\sim\) TV + Radio + Newspaper + Radio:Newspaper

Code
df <- read_csv('https://www.statlearning.com/s/Advertising.csv')
New names:
Rows: 200 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(5): ...1, TV, radio, newspaper, sales
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
# list possible models
models <- list(
  model1 = sales ~ TV + radio + newspaper,
  model2 = sales ~ TV + radio + newspaper + TV:radio,
  model3 = sales ~ TV + radio + newspaper + TV:newspaper,
  model4 = sales ~ TV + radio + newspaper + radio:newspaper
)

# for storing results
cv_results <- vector("list", length(models))
names(cv_results) <- c("Base", "TV:Radio", "TV:Newspaper", "Radio:Newspaper")

# loop around possible models
for (i in seq_along(models)) {
  glm_model_selected <- glm(formula = models[[i]], data = df)
  cv_result <- cv.glm(df, glm_model_selected, K = 10)
  
  cv_results[[i]] <- list(cv_error = cv_result$delta[1])
}

# Results fixing diplay (look like a data frame / tibble)
cv_results_df <- bind_rows(cv_results, .id = "Model")

# Print the results
print(cv_results_df)
# A tibble: 4 × 2
  Model           cv_error
  <chr>              <dbl>
1 Base               2.93 
2 TV:Radio           0.943
3 TV:Newspaper       2.89 
4 Radio:Newspaper    3.06 

TV interacting with radia should be included as the error is significantly decreased.

HW-2.4.b Create a single plot showing the return on investment of each advertising method where the y-axis is Sales and the x-axis is advertising dollars. There should be three lines, one for each method. The slope is the coefficient from you regression. What is the best advertising method to invest in based on return on investment?

Code
# INSERT CODE

# Models
model_tv <- lm(sales ~ TV, data = df)
model_radio <- lm(sales ~ radio, data = df)
model_newspaper <- lm(sales ~ newspaper, data = df)

coeff_tv <- coef(model_tv)
coeff_radio <- coef(model_radio)
coeff_newspaper <- coef(model_newspaper)

ad_dollars <- seq(from = 0, to = max(df$TV, df$radio, df$newspaper), by = 1)

sales_tv <- coeff_tv[1] + coeff_tv[2] * ad_dollars
sales_radio <- coeff_radio[1] + coeff_radio[2] * ad_dollars
sales_newspaper <- coeff_newspaper[1] + coeff_newspaper[2] * ad_dollars

plot_data <- tibble(
  AdvertisingDollars = c(ad_dollars, ad_dollars, ad_dollars),
  sales = c(sales_tv, sales_radio, sales_newspaper),
  Method = factor(rep(c("TV", "radio", "newspaper"), each = length(ad_dollars)))
)

# Plot
ggplot(plot_data, aes(x = AdvertisingDollars, y = sales, color = Method)) +
  geom_line() +
  labs(title = "ROI on advertisement method",
       x = "Advertising $",
       y = "Sales",
       color = "Method") +
  my_theme()

The best on ROI seems to be radio, so we should recommend investing more in radio advertisements.

HW-2.5: ISLR-6.8 #8(a-d)

Part-A

Code
# INSERT CODE
X <- rnorm(100)
noise <- rnorm(100)

Part-B

Code
# INSERT CODE
B_0 <- 5
B_1 <- 5
B_2 <- 4
B_3 <- 4

Y <- B_0 + B_1 * X + B_2 * X^2 + B_3 * X^3 + noise

Part-C

Code
# INSERT CODE

ex_5_df <- tibble(X, Y)

model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df)

summary <- summary(model)

CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)

best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))

cat("Best subets:","\n")
Best subets: 
Code
cat("Cp:", paste(best_vars_CP[-1], collapse = ", "), "\n","\n")
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10 
 
Code
cat("BIC:", paste(best_vars_BIC[-1], collapse = ", "), "\n","\n")
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3 
 
Code
cat("Adjusted R^2:", paste(best_vars_R2[-1], collapse = ", "), "\n","\n")
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10 
 
Code
# INSERT CODE
par(mfrow = c(2, 2))

plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)

plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)

plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)

Part-D:

Forward selection

Code
# INSERT CODE
model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df, method = "forward")

summary <- summary(model)

CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)

best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))

cat("Best subets:","\n")
Best subets: 
Code
cat("Cp:", paste(best_vars_CP[-1], collapse = ", "), "\n","\n")
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3 
 
Code
cat("BIC:", paste(best_vars_BIC[-1], collapse = ", "), "\n","\n")
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3 
 
Code
cat("Adjusted R^2:", paste(best_vars_R2[-1], collapse = ", "), "\n","\n")
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3 
 
Code
# INSERT CODE
par(mfrow = c(2, 2))

plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)

plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)

plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)

Backward selection

Code
# INSERT CODE
model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df, method = "backward")

summary <- summary(model)

CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)

best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))

cat("Best subets:","\n")
Best subets: 
Code
cat("Cp:", paste(best_vars_CP[-1], collapse = ", "), "\n","\n")
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10 
 
Code
cat("BIC:", paste(best_vars_BIC[-1], collapse = ", "), "\n","\n")
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10 
 
Code
cat("Adjusted R^2:", paste(best_vars_R2[-1], collapse = ", "), "\n", "\n")
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10 
 
Code
# INSERT CODE
par(mfrow = c(2, 2))

plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)

plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)

plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)

While backward method suggests a model with more variables than the one in part c and the one using the calculated the forward method, these two actually agree with each other. Thus, I would pick three variables as it still does almost as the optimal one calculated using the backward method and the other two agree with the claim.